Factor and Categorical Matrices

In this lecture we will discuss the factor() function and its use for creating categorical matrices. This specific function will become extremely useful when we begin to apply data analysis and machine learning techniques to our data, this idea is sometimes also known as creating dummy variables.

Let's start by showing an example of why and how we build this matrix. Imagine we have the following vectors representing data from an animal sanctuary for dogs ('d') and cats ('c') where they each have a corresponding id number in another vector.

In [6]:
animal <- c('d','c','d','c','c')
id <- c(1,2,3,4,5)

We want to convert the animal vector into information that an algorithm or equation can understand more easily. Meaning we want to begin to check how many categories (factor levels) are in our character vector.

We can pass the vector through the factor() function like so to get this information:

In [9]:
factor.ani <- factor(animal)
In [10]:
# Will show levels as well on RStudio or R Console
factor.ani
Out[10]:
  1. d
  2. c
  3. d
  4. c
  5. c

We can see that we have two levels, 'd' and 'c'. In R there are two distinct types of categorical variables, a ordinal categorical variable and a nominal categorical variable .

Nominal categorical variables don't have any order, such as dogs and cats (there is no order to them). Versus Ordinal categorical variables (as the name implies) do have an order. For example, if you had the vector:

In [13]:
ord.cat <- c('cold','med','hot')

You could begin to assign in order to these variables, such as:

cold < med < hot

if you wanted to assign an order while using the factor() function, you can pass in the arguments ordered=True and the pass in the levels= and pass in a vector in the order you want the levels to be in. So for example:

In [20]:
temps <- c('cold','med','cold','med','hot','hot','cold')
fact.temp <- factor(temps,ordered=TRUE,levels=c('cold','med','hot'))
fact.temp
Out[20]:
  1. cold
  2. med
  3. cold
  4. med
  5. hot
  6. hot
  7. cold

This information is useful when used along with the summary() function which is an amazingly convenient function for quickly getting information from a matrix or vector. For example:

In [21]:
summary(temps)
Out[21]:
   Length     Class      Mode 
        7 character character 
In [22]:
summary(fact.temp)
Out[22]:
cold
3
med
2
hot
2

Later on we will revisit this idea of using the factor() function and ordered argument. For now that is all we need to know.